Skip to contents

#WORK IN PROGRESS

Summary

This article focuses on the spatial representation of dissimilarity scores.

TODO

Setup

library(distantia, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(mapview, warn.conflicts = FALSE)

Example Data

The examples below illustrate how to map dissimilarity scores using the datasets covid_prevalence and covid_polygons included with distantia. The dataset comprises time series of weekly Covid-19 prevalence in several California counties.

tsl <- distantia::tsl_initialize(
  x = distantia::covid_prevalence,
  name_column = "name",
  time_column = "time"
)

distantia::tsl_plot(
  tsl = tsl[1:10],
  columns = 2,
  guide = FALSE
)

The map below displays the polygons in distantia::covid_counties using mapview().

mapview::mapview(
  distantia::covid_counties,
  label = "name"
)

Dissimilarity Analysis

The code in this section prepares several datasets to use in the different mapping examples:

  • df_psi: a lock-step dissimilarity data frame with psi scores and p-values.
  • df_stats: dissimilarity stats of each time series against all others.
  • df_cluster: a hierarchical clustering based on the dissimilarity scores in df_psi.

The lock-step dissimilarity analysis shown below includes p-values from a restricted permutation test. These p-values will be useful as criteria to select relevant mapping features.

#parallelization setup
future::plan(
  future::multisession,
  workers = parallelly::availableCores() - 1
  )

#lock-step dissimilarity analysis
df_psi <- distantia::distantia(
  tsl = tsl,
  distance = "euclidean",
  lock_step = TRUE,
  repetitions = 1000,
  permutation = "restricted",
  block_size = 12 #weeks
)

#disable parallelization
future::plan(
  future::sequential
  )

#check resulting data frame
df_psi |> 
  dplyr::select(x, y, psi, p_value) |> 
  dplyr::glimpse()
#> Rows: 630
#> Columns: 4
#> $ x       <chr> "Napa", "Alameda", "Alameda", "Sacramento", "San_Joaquin", "Sa…
#> $ y       <chr> "Solano", "San_Mateo", "Contra_Costa", "Sonoma", "Stanislaus",…
#> $ psi     <dbl> 0.8726115, 1.0656371, 1.1620553, 1.2578125, 1.2919255, 1.29793…
#> $ p_value <dbl> 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001,…

The code below aggregates psi scores in the data frame df_psi to summarize the overall dissimilarity of each time series with all others.

df_stats <- distantia::distantia_stats(
  df = df_psi
)
#> Loading required package: foreach
#> Loading required package: future

df_stats |> 
  dplyr::select(
    name, mean
  ) |> 
  dplyr::glimpse()
#> Rows: 36
#> Columns: 2
#> $ name <chr> "Alameda", "Butte", "Contra_Costa", "El_Dorado", "Fresno", "Humbo…
#> $ mean <dbl> 3.214531, 3.266749, 3.358524, 3.546757, 3.270684, 4.028824, 3.834…

The function distantia_cluster_hclust() runs a hierarchical clustering using the psi scores in df_psi as clustering criteria.

df_cluster <- distantia::distantia_cluster_hclust(
  df = df_psi
)$df

df_cluster |> 
  dplyr::select(name, cluster) |>
  dplyr::glimpse()
#> Rows: 36
#> Columns: 2
#> $ name    <chr> "Napa", "Alameda", "Sacramento", "San_Joaquin", "Santa_Clara",…
#> $ cluster <dbl> 1, 2, 2, 1, 3, 2, 2, 3, 2, 1, 1, 1, 2, 1, 2, 4, 2, 1, 1, 1, 2,…

Network Map

The function distantia_spatial_network() transforms the result of distantia() to an sf data frame with edges connecting time series coordinates or polygons. The result can be interpreted as a dissimilarity network.

sf_network <- distantia::distantia_spatial_network(
  df = df_psi,
  sf = distantia::covid_counties |> 
    dplyr::select(
      name, geometry
    )
)

dplyr::glimpse(sf_network)
#> Rows: 630
#> Columns: 9
#> $ edge_name <chr> "Napa - Solano", "Alameda - San_Mateo", "Alameda - Contra_Co…
#> $ y         <chr> "Solano", "San_Mateo", "Contra_Costa", "Sonoma", "Stanislaus…
#> $ x         <chr> "Napa", "Alameda", "Alameda", "Sacramento", "San_Joaquin", "…
#> $ psi       <dbl> 0.8726115, 1.0656371, 1.1620553, 1.2578125, 1.2919255, 1.297…
#> $ p_value   <dbl> 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.00…
#> $ null_mean <dbl> 2.610115, 2.532170, 2.406356, 2.742062, 2.882596, 2.857959, …
#> $ null_sd   <dbl> 0.2271616, 0.2759364, 0.2829994, 0.3067348, 0.2033813, 0.235…
#> $ geometry  <LINESTRING [°]> LINESTRING (-122.3298 38.50..., LINESTRING (-121.…
#> $ length    <dbl> 43101.68, 46400.19, 30278.10, 135022.90, 48050.18, 33550.83,…

The resulting sf data frame has a field with the edge name, the columns x and y with the names of the connected time series, the psi scores and p-values of the dissimilarity data frame, a geometry column of type LINESTRING defining the network edges, and the length of the edges.

This sf data frame can be mapped right away, but in this case there are too many pairs of counties to achieve a meaningful map.

mapview::mapview(
  sf_network,
  layer.name = "Psi",
  label = "edge_name",
  zcol = "psi",
  color = distantia::color_continuous()
)

Focusing on particular aspects of the data at hand may help untangle this mess. For example, the code below subsets edges connecting with San Francisco and its most similar counties in terms of Covid19 prevalence.

#select pairs of counties
counties <- c(
  "Los_Angeles", 
  "San_Francisco", 
  "Fresno", 
  "San_Joaquin",
  "Monterey"
  )

sf_network_subset <- sf_network[
  which(
    sf_network$x %in% counties & 
      sf_network$y %in% counties
    ), 
  ]

#map country polygons and dissimilarity edges
mapview::mapview(
  covid_counties,
  col.regions = NA,
  alpha.regions = 0,
  color = "black",
  label = "name",
  legend = FALSE,
  map.type = "OpenStreetMap"
) +
  mapview::mapview(
    sf_network_subset,
    layer.name = "Psi",
    label = "edge_name",
    zcol = "psi",
    lwd = 5,
    color = distantia::color_continuous(
      rev = TRUE
      )
  )

The function distantia_spatial_network() also has a one-to-many mode designed to help map the dissimilarity of one time series against all others.

sf_network <- distantia::distantia_spatial_network(
  df = df_psi,
  sf = distantia::covid_counties |> 
    dplyr::select(
      name, geometry
    ),
  one_to_many = TRUE
)

dplyr::glimpse(sf_network)
#> Rows: 1,295
#> Columns: 7
#> $ x         <chr> "Alameda", "Alameda", "Alameda", "Alameda", "Alameda", "Alam…
#> $ y         <chr> "Alameda", "Butte", "Contra_Costa", "El_Dorado", "Fresno", "…
#> $ psi       <dbl> 0.000000, 2.962963, 1.162055, 2.483755, 3.456869, 3.960000, …
#> $ p_value   <dbl> 0.000, 0.050, 0.001, 0.001, 0.318, 0.975, 0.222, 0.135, 0.98…
#> $ null_mean <dbl> NA, 3.245215, 2.406356, 3.018440, 3.573495, 3.651552, 4.1563…
#> $ null_sd   <dbl> NA, 0.17327260, 0.28299936, 0.19949506, 0.23360318, 0.161934…
#> $ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((-121.9907 3..., MULTIPOLYGON ((…

This option does not create edges, and instead uses the geometry introduced in the sf argument. This sf data frame is mapped as follows: when one site in the “x” column is selected, the result shows the dissimilarity scores of all other sites against the selected one.

#subset one county
sf_network_subset <- sf_network[
  sf_network$x == "San_Francisco", 
  ]

#one-to-many visualization
mapview::mapview(
    sf_network_subset,
    layer.name = "Psi",
    label = "y",
    zcol = "psi",
    col.regions = distantia::color_continuous(
      rev = TRUE
      )
  )

Dissimilarity Stats Map

Mapping the dissimilarity stats of each time series may help identify places that are somewhat special because they show a high dissimilarity with all others, or places that are average and have no distinctive features.

Merging the dissimilarity stats with the sf data frame containing the county polygons generates the spatial data required for the map.

sf_stats <- merge(
  x = distantia::covid_counties,
  y = df_stats
) |> 
  dplyr::select(
    mean,
    name
  )

The map below uses warm colors to highlight places with a high dissimilarity with all others, such as Humboldt county in the north west of the state.

mapview::mapview(
  sf_stats,
  layer.name = "Psi mean",
  zcol = "mean",
  label = "name",
  col.regions = distantia::color_continuous(),
  alpha.regions = 1
)

Clustering Map

This section shows how to map the similarity groups resulting from distantia::distantia_cluster_hclust() or distantia::distantia_cluster_kmeans().

The first block in the code below joins the county polygons in distantia::covid_polygons with the clustering data frame df_cluster, while creating a new variable “alpha” based on the column “silhouette_width”, which represents the strength of group membership.

The second block generates the map, using the new variable to define transparency. This setup allows to identify each group of similar counties while helping identify these counties that somehow do not fully belong. This is the case of Butte, which is in the group 5, but with a very low silhouette score.

#join county polygons with clustering groups
sf_cluster <- distantia::covid_counties |> 
  dplyr::select(name, geometry) |> 
  dplyr::inner_join(
    y = df_cluster,
    by = "name"
  ) |> 
  #remap silhouette score to transparency values
  dplyr::mutate(
    alpha = f_rescale_local(
      x = silhouette_width,
      new_min = 0.1
      )
  )

mapview::mapview(
  sf_cluster,
  layer.name = "Group",
  zcol = "cluster",
  label = "name",
  col.regions = distantia::color_continuous(
    n = max(sf_cluster$cluster)
  ),
  alpha.regions = sf_cluster$alpha
)